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Abstract.  A  dedicated  processor  architecture  is  here  introduced  for  multiproces¬ 
sor  array  implementations  of  Jacobi  methods.  Oue  new  processing  element  fac¬ 
tors  arbitrary  rotations  into  products  of  elementary  rotations  whose  angles  are 
exactly  twice  the  CORDIC  elementary  angles.  This  characteristic  permits  com¬ 
plete  concurrency  between  the  evaluation  of  the  Jacobi  rotations  and  their  appli¬ 
cation.  Thus  the  processors  in  the  resulting  arrays  are  almost  never  idle. 
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I.  INTRODUCTION 


Symmetric  eigenvalue  and  singular  value  decompositions  (SVD)  are  often  at 
the  core  of  today’s  signal  processing  techniques.  Because  these  decompositions 
demand  a  large  number  of  computations,  and  because  of  the  stringent 
throughput  requirements  of  adaptive  beamforming,  direction  finding  and  other 
real-time  signal  processing  applications  that  employ  them,  several  parallel  archi¬ 
tectures  tailored  to  their  implementation  have  elsewhere  been  proposed.  These 
architectures  are  briefly  reviewed  in  [0].  A  consequence  of  this  burgeoning 
interest  in  simple  dedicated  parallel  implementations  has  been  the  revival  of  the 
Jacobi  methods,  eclipsed  for  about  twenty  years  by  the  computationally  less 
demanding  QR-factorization  methods  [7],  The  Jacobi  methods  have  the  advan¬ 
tage  of  leading  to  highly  regular  implementations  with  local  communications  and 
simple  control. 

The  Jacobi  method  is  applied  to  a  square  n  X  n  matrix,  either  the  original 
matrix  for  the  eigenvalue  problem  or  the  square  factor  computed  in  a  preparatory 
QR-factorization  step  for  the  SVD  problem  [2].  In  this  paper  the  matrix  is 
assumed  to  be  real.  (Note  however  that  the  Jacobi  approach  is  very  general  and 
can  also  be  used  for  parallel  computation  when  the  matrix  is  complex  [6].)  A  par¬ 
ticular  ordering  scheme  for  the  rotations  in  the  Jacobi  approach  allows  for  a  high 
degree  of  parallelism,  enabling  the  use  of  two-dimensional  arrays  of  processors  to 
implement  a  Jacobi  sweep  in  0{n )  time  [lj,  [2].  Experiments  indicate  that  only 
O(logn)  sweeps  are  needed  for  convergence,  giving  an  O(nlogn)  total  time  for 
the  diagonalization  of  the  n  X  n  matrix. 

The  multiprocessor  implementation  developed  in  [1]  and  [2]  consists  of  a 

square  array  of  -—X~  processors  with  nearest  neighbor  interconnections.  Each 
*  • 

sweep  comprises  a  cyclic  sequence  of  n-1  steps.  At  the  start  of  a  step  the 
current  matrix  to  be  operated  upon  is  already  stored  in  a  distributed  fashion  in 
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the  processors  of  the  array.  The  matrix  is  partitioned  into  2X2  blocks,  and  adja¬ 
cent  blocks  are  assigned  to  adjacent  processors.  The  step  starts  with  the  parallel 
evaluation  by  the  diagonal  processors  of  the  rotations  9  and  9  that  diagonalize 
the  2X2  block  stored  in  each  of  them 

t  cos  9  sin  9  w  a  c  w  cos  9  -  sin  9  \  (  T  ol 

'-sin  9  cos  9 '  '  b  i  *  '  sin  9  cos  9  *  (  0  j)  ’ 

The  angles  9  and  9  are  propagated  to  all  processors  in  the  same  row  and  the 
same  column,  respectively,  as  the  diagonal  processor  that  has  evaluated  them. 
Every  processor  in  the  array  pre-  and  post-multiplies  the  2X2  block  it  currently 
has  in  storage  by  the  rotations  9  and  9  that  it  has  received.  Then,  to  complete 
the  step,  matrix  elements  are  interchanged  between  adjacent  processors  in  such  a 
way,  discovered  in  [1],  as  to  enable  a  complete  sweep  in  only  n  -1  steps. 

The  angles  9  and  9  are  easily  seen  to  satisfy  [2] 

tan (9+9  )  =  and  tan (9-9  )  =  -LzL  . 

a  -  a  a  4-  a 

The  main  role  of  the  diagonal  processors  is  to  extract  9  and  9  from  these  equa¬ 
tions,  while  the  function  of  the  off-diagonal  processors  is  simply  to  apply  the  rota¬ 
tions  defined  by  these  angles.  In  order  to  use  the  array  efficiently,  the  diagonal 
processors  should  be  as  fast  as  the  off-diagonal  processors,  even  though  they  must 
perform  a  more  complex  task.  The  array  efficiency  will  be  further  increased  if 
the  evaluation  of  the  angles  by  the  diagonal  processors  and  the  application  of  the 
rotations  overlap  significantly  in  time.  The  CORDIC  concept  is  here  fully 
exploited  in  order  to  meet  these  goals.  In  fact,  according  to  the  implementation 
here  proposed,  the  evaluation  and  application  of  the  rotations  of  a  given  step 
start  and  finish  almost  simultaneously.  As  a  result  the  efficiency  of  the  array  is 
almost  one,  which  is  a  significant  improvement  over  the  work  presented  in  [3J. 
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In  the  symmetric  case  the  computation*!  burden  of  the  diagonal  processors  is 


reduced,  since  then  c  always  equals  6 ,  which  implies  9=9  and 


tan20 


2ft 

s -4  • 


The  array  itself  may  furthermore  be  reduced  to  a  triangle:  in  the  square  array, 
the  blocks  pertaining  to  a  pair  of  processors  which  are  positioned  symmetrically 
with  respect  to  the  diagonal  are  always  the  transpose  of  each  other.  Since  it  is 
conceptually  simpler,  the  implementation  of  this  important  special  case  will  be 


presented  first. 


D.  SYMMETRIC  EIGENVALUE  PROBLEM 


Some  elementary  facts  from  CORD1C  arithmetic  [10],  [11]  will  first  be 
reviewed,  as  we  shall  build  upon  them  in  order  to  derive  our  implementations. 


CORDIC  arithmetic  hinges  on  a  fundamental  property:  if  a  matrix  function 
J?(0)  can  be  expressed  as  a  matrix  exponential  R(9)=  eAt,  an  additive  decom¬ 
position  of  6  yields  a  multiplicative  decomposition  of  R(9)  [4],  [5].  The  off- 
diagonal  processors  in  the  symmetric  eigenvalue  array  must  apply  plane  rota¬ 


tions, 


d  ia\  _  (  cos  9  sin  9  \ 

.-a  (0)  Lsin0  Cos0'  ’ 


which  satisfy  the  above  property,  since 


R{9)  =  eAt  with  A  =  (  ^  J) 


Consequently,  assuming  the  availability  of  the  encoding  of  9  as  a  set  of  signs 


7,-  =  ±1  such  that  9 =  £7 ,-0,,  where  the  positive  angles  9 ,  form  a  preset 


sequence,  the  rotation  of  a  vector  by  R{9)  can  be  performed  by  applying  the 


sequence  of  elementary  rotations  R  (7,-  0, )  to  that  vector.  It  remains  to  specify 
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how  the  angles  0{  are  selected  to  make  the  elementary  rotations  easy  to  perform. 
Expressing  the  rotation  R  hi  9{ )  in  terms  of  /,•  =  tan#, , 

R  hi  *1 )  *  (J  +  A.  7,-  ti  )/v^det (/  +  A  I;  t, ) , 
and  noting  that  the  scaling  factor  is  independent  of  7,- , 

v/drt(/  +  A  1,  (, )  -  y/TTi?  h  S(li ), 

we  tee  that  the  rotetion  A  (0)*  A  (l,  0, )  mey  be  implemented  by  performing 

i-1 

sequentially  the  product  by  the  p  factors  /  +  A  7,- 1,  and  then  dividing  the  result 

by  the  global  scaling  factor  S  *  (4],  [11].  (Alternatively  the  scaling 

•  —1 

could  be  done  first,  followed  by  the  p  products  by  /  +  A  7 ;•  /,• . )  If  the  elementary 
angles  #,■  are  selected  such  that  the  tangents  t,-  are  integer  powers  of  two,  multi¬ 
plication  of  a  vector  by  I  +  A  %  t{  is  simply  a  pair  of  coupled 
addition/subtractions, 

*,  +1  =  *.  +  %  *.  Vi  .  Vi  +i  =  Vi  ~  7.  ti  *i  , 
where  the  vector  components  x,  and  y,-  are  shifted  by  f,-  and  then  added  or  sub¬ 
tracted  to  y,  or  x,- ,  respectively. 

A  high  performance  implementation  of  that  operation  will  use  two  parallel 
adder/subtractors,  one  for  each  equation,  and  two  barrel  shifters,  for  quick  multi¬ 
plication  by  arbitrary  integer  powers  of  two.  That  architecture,  with  a  minor 
modification,  can  also  perform  the  division  by  the  global  scaling  factor  5.  Indeed 

S~l  can  be  decomposed  into  a  product  S~l  =1*1(1  +  #,  «,  ),  where  =  ±1 

i-l 

and  each  «,  is  an  integer  power  of  two.  Scaling  of  a  vector  by  1  +  6,  a,  is  sim¬ 
ply  a  pair  of  decoupled  addition /subtractions, 

*.+i  =  *,  +  > 


»i+i  —  Vi  +  Sf  Si  yi  , 
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where  the  components  i,  and  y,-  are  shifted  by  and  then  added  or  subtracted 
to  x,  or  y, ,  respectively.  Hence  only  a  multiplexer  has  to  be  added  to  the  basic 
architecture,  to  direct  the  shifted  x,  towards  the  y -adder  and  the  shifted  jr, 
towards  the  x -adder  (coupled  mode)  or  to  direct  the  shifted  x,-  towards  the  x- 
adder  and  the  shifted  y,-  towards  the  y -adder  (decoupled  mode).  (Note  that,  for 
compactness  and  speed,  the  multiplexer  and  the  two  barrel  shifters  could  be 
merged  into  a  single  structure.) 

The  total  time  to  perform  a  rotation  is  independent  of  the  rotation  angle 
and  equals  the  time  to  perform  p  coupled  pairs  of  addition/subtractions  and  q 
decoupled  pairs  of  addition/subtractions.  Since  in  the  above  implementation  a 
pair  of  addition/subtractions,  whether  it  is  coupled  or  decoupled,  requires  the 
same  amount  of  time  -  one  cycle  -  the  total  time  to  perform  a  rotation  is  p  +  q 
cycles.  The  tangents  t{  should  therefore  be  selected  to  minimize  p  +  q  [4]. 
Assuming,  without  loss  of  generality,  that  the  tangents  f,  are  non-increasing, 
that  is  t,-  >  t y  for  1  <  i  <  j  <  p ,  any  rotation  within  a  range  (-0m„,  +0mM) 

can  be  implemented  with  accuracy  e  if  and  only  if  <  ]£]  0}  +  e  and 

j-i 

$i  <  jf]  +  e  ,  f  or  1  <  »  <  p  . 

y-.+i 

These  conditions,  derived  in  [10],  are  easily  seen  to  be  satisfied  if  *,+.>*,  / 2 
for  1  <  «  <  p-1  and  tp  —  €.  The  freedom  in  the  choice  for  <l  +  x  of  either  t, 
or  ft-  /2  may  be  exploited  in  order  to  minimize  p  +  q :  by  properly  choosing  the 
tangent  values  to  be  repeated,  we  may  modify  the  scaling  factor  S  and  reduce 
the  number  q  of  factors  1  +  in  the  expansion  of  its  reciprocal  [4j.  Given 
the  desired  range  and  accuracy,  we  may  achieve  that  minimization  via  a  search 
procedure  such  as  simulated  annealing  [8j.  For  0mu  =  jt/4,  the  range  for  the 
symmetric  eigenvalue  problem  [1],  and  for  the  resolution  e  =  2~33,  our  simulated 


annealing  program  generated  the  sequences 


f,  «  2~2,  2 r*  2 -*  2“4,  2-4,  2“*,  2“* . .  2 "**, 

s,  =s  -2"*,  +2-4,  -r15,  +2’“,  -2-18,  -2*a,  -r*4, 


giving  a  rotation  time  of  just  40  cycles.  The  control  unit  of  each  processor  is  pro¬ 
grammed  to  cycle  through  such  a  sequence. 

In  the  proposed  architecture  for  the  symmetric  eigenvalue  problem,  the  off- 
diagonal  processors  will  each  have  two  processing  units  that  execute  in  parallel 
the  standard  CORDIC  algorithm  presented  above,  with  minimal  number  of  cycles 
p  +q .  In  the  first  phase  of  each  step,  each  processing  unit  rotates  a  column  of 
the  2X2  block  stored  in  the  off-diagonal  processor  by  the  angle  9  whose  encoding 
is  generated  by  the  diagonal  processor  in  the  same  row.  In  the  second  phase  of 
the  step,  each  processing  unit  now  operates  on  a  row  of  the  modified  2X2  block 
and  rotates  it  by  the  angle  9  whose  encoding  has  been  generated  in  the  first  phase 
by  the  diagonal  processor  in  the  same  column. 

While  the  off-diagonal  processors  use  the  above  standard  CORDIC  algorithm 
to  apply  rotations  by  angles  9  encoded  as  sequences  of  signs  7,- ,  1  <  i  <  p ,  the 
diagonal  processors  must  on  the  other  hand  perform  the  more  complex  task  of 
computing  the  encoding  of  the  angle  9  which  satisfies  tan20  —  26  /(a  -  d),  given 
the  entries  (a ,  6  ,  d)  of  their  current  block.  The  first  phase  of  the  steps  is  criti¬ 
cal,  since,  during  that  phase,  rotations  must  be  applied  simultaneously  with  their 
evaluation.  Such  concurrency  may  be  achieved  if  the  encoding  is  generated 
recursively,  that  is,  one  sign  7,-  every  clock  cycle.  Indeed,  although  several  cycles 
may  be  needed  to  propagate  a  bit  along  a  row  of  the  array,  q  cycles  are  available 
for  that  propagation  if  the  off-diagonal  processors  spend  the  first  q  cycles  of  the 
first  phase  performing  the  scaling  part  of  the  rotations.  For  practical  array  sizes, 
this  q -cycle  upper  bound  on  the  propagation  delay  can  be  met  without  much 
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difficulty. 

The  encoding  will  be  generated  on  line  if  the  diagonal  processors  can  per¬ 
form  rotations  by  20;  for  1  <  «  <  p  and,  moreover,  if  the  decomposition  of  20 

«  £  li  ‘20, •  is  generated  recursively.  The  rotation  by  7,-  20;  is 

*— i 


Rhr29;)  =  fih;0;)R 


therefore,  up  to  a  scaling  factor  independent  of  the  sign  % ,  it  is  a  multiplication 
by 

|  l-<,2  1,21,  | 

1-7,  1-1, s  I  ' 

Since  the  tangents  t{  are  integer  powers  of  two,  this  multiplication  may  easily  be 
implemented  with  shifts  and  addition/subtractions.  Specifically,  the  operations 
are 

*.  +l  =  (1  -  O*.  +  ~li  -2t,  Vi  ,  Jf,  +I  =  (1  -  t;2)y;  -  7,  2t;  X;  . 

These  operations  may  be  performed  in  the  same  amount  of  time  as  an  off- 
diagonal  processor  cycle  by  a  processor  with  four  barrel  shifters,  two  linear  arrays 
of  carry  save  adders  (plus  some  inverters)  and  two  adders.  The  barrel  shifters  are 
of  two  different  types,  in  order  to  shift  x ;  and  V;  >n  parallel  by  the  two  amounts 
2 1;  and  t,-2.  The  arrays  of  carry-save  adders  funnel  data  into  the  two  adders. 
Specifically,  they  compress  the  triples  (i, ,  ,  %  21;  y, )  and 

(ya- ,  -t;2y; ,  -7,-  -2t;X; )  into  pairs  of  inputs  for  the  two  adders  which  compute, 
respectively,  x,  +,  and  y,+j.  The  carry-save  adder  delay  is  much  smaller  than  the 
adder  delay:  hence  the  claim  that  a  diagonal  processor  can  perform  the  whole 
multiplication  in  one  cycle.  The  signs  7,-  can  be  generated  recursively  by  using 
the  simple  control  law  (5] 
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li  =  «0»  (*,"*.)  • 

This  law  ensures  that  the  rotation  is  always  performed  in  the  proper  direction,  so 
that,  starting  with  Xj  =  (a  -  d)j 2  and  y  j  =  6 ,  the  absolute  value  of  the  angle 

between  (x, ,  y, )  and  {sign  (a  -d ),  0)  is  always  less  than  £  20,  +  2c. 

/— • 

This  novel  processor  architecture,  introduced  to  implement  the  diagonal  of 
the  symmetric  eigenvalue  array,  will  also  be  used  for  the  processors  in  the  SVD 
array.  The  processor  has  the  flexibility  to  perform  either  the  coupled  pair 

x+  *  (l-at2)z  +  r2ty  ,  y+  =  (1  -at2)y  -  r^tx  , 
or  the  decoupled  pair 

x  +  =  (1  +  <2)x  +  06  2tx  ,  y+  “  (1  +  f2)y  4-  P6-2ty  , 

where  a  and  equal  either  0  or  1,  7  and  S  equal  either  1  or  *1,  and  t  can  be  any 
negative  integer  power  of  two  between  2~2  and  c  (note  that  t  =  2_l  is  not 
included).  The  x  and  y  registers  may  be  loaded  independently  at  the  start  of 
each  cycle.  The  computation  of  (a  -</)/ 2,  which  can  be  seen  to  require  two 
cycles,  and  the  computation  of  a  -  d  and  2 b ,  which  requires  four  cycles  because 
each  computation  needs  two  cycles  and  they  cannot  be  performed  concurrently, 
illustrate  the  limitations  of  the  processor. 

Assume  for  a  moment  that  at  the  start  of  the  first  phase  of  a  step  the  diago¬ 
nal  processors  have  already  set  their  respective  Xj  =  (a  -  d)/2  and  y ,  =  6  ,  and 
that  the  associated  sign,  71(  has  just  been  sent  to  the  off-diagonal  processors  in 
the  corresponding  row.  We  shall  assume  furthermore  that  7|  reaches  all  these 
processors  before  the  end  of  cycle  q .  On  the  first  cycle,  the  diagonal  processors 
perform  the  (unsealed)  rotation  by  7,  20x  and  evaluate  72,  which  is  sent  to  the 
off-diagonal  processors  in  the  corresponding  row.  The  last  sign,  7p ,  is  evaluated 
on  cycle  p-1.  The  off-diagonal  processors  spend  the  first  q  cycles  scaling  the 
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vectors  which  they  are  rotating;  by  doing  the  scaling  iterations  at  the  beginning 
of  the  phase  rather  than  at  its  end,  q  cycles  are  provided  for  the  horizontal  pro¬ 
pagation  of  the  signs  7,- .  On  cycle  q  +1  the  off-diagonal  processors  apply  from 
the  left  the  (unsealed)  rotation,  by  7^,  to  their  respective  block.  They  apply 
the  last  rotation,  by  7,  9? ,  on  cycle  p  +q .  The  diagonal  processors  initiate  on 
cycle  p  the  computation  of  the  updates  of  the  diagonal  elements  a  and  d ,  which 
are  easily  seen  to  satisfy 

a  =  —  +  [-  --— co$2 9+  b  sin2fl ) 

m  i 

i  =  —  cos 20  +  b  sin20 1  . 

2  2 

To  obtain  the  term  between  brackets,  which  is  the  first  component  of  the  vector 
((a  -  d  )/2,  b  )  after  its  rotation  by  29,  the  rotation  of  ((a  -  d  )/2,  6  )  is  completed 
as  follows.  On  cycle  p  apply  the  unsealed  rotation  by  7p  -2 9p .  Then,  for  the 
next  q  cycles,  perform  the  scaling  part  of  the  rotation,  according  to  the  decou¬ 
pled  equations 

*,•+1  =  (1  +  *.-V,  +  Si  -2 «,  X i  ,  yl  +,  =  (1  +  «i2)y{  +  6{  -2s,  y,  . 

The  evaluation  of  the  term  between  brackets  thus  terminates  on  cycle  p  +q . 

In  the  second  phase,  every  off-diagonal  processor  rotates  its  2X2  block  from 
the  right  by  the  angle  9  whose  encoding,  issued  on  the  first  phase  by  the  diagonal 
processor  in  the  same  column,  has  already  been  received.  This  requires  p  +  q 
cycles,  like  the  first  phase.  The  diagonal  processors  spend  the  first  two  cycles 
computing  (a  +  d)f 2,  perform  the  addition  that  gives  a  on  the  next  two  cycles, 
and  effect  on  the  following  two  cycles  the  subtraction  that  yields  d .  The  remain- 
ing  p  +q  -6  cycles  provide,  as  we  shall  see  presently,  more  than  enough  time  to 
let  the  diagonal  processors  take  the  head  start  on  the  first  phase  of  the  step,  as 
we  earlier  assumed. 
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Id  between  two  steps,  the  processors  must  exch&Dge  data.  In  contrast  with 
the  propagation  of  the  single  bits  qr,- ,  which  takes  place  along  horizontal  and 
vertical  directions,  this  exchange  is  performed  along  diagonal  connections.  Each 
processor  has  the  ability  to  exchange  data  with  its  four  neighboring  processors 
along  the  diagonal  directions  [1],[2].  Both  diagonal  elements  within  a  block  are 
exchanged  with  elements  on  the  diagonals  of  those  blocks  which  reside  in  the  two 
neighboring  processors  along  the  same  diagonal  direction;  the  same  holds  for  the 
other  two  elements  in  the  block  with  respect  to  the  other  diagonal  direction.  In 
particular,  the  diagonal  entries  of  the  matrix  are  always  exchanged  between  diag¬ 
onal  processors.  The  interchange  itself  may  take  a  few  cycles;  it  will  require  four 
cycles  assuming  32-bit  precision  and  one  processor  per  VLSI  chip,  with  64  bidirec¬ 
tional  I/O  pins  dedicated  to  this  exchange.  The  diagonal  processors,  which  finish 
updatiug  the  entries  in  their  block  about  p  cycles  ahead  of  the  off-diagonal  pro¬ 
cessors,  can  exchange  their  diagonal  elements  with  a  p  -cycle  lead  time.  Thus  are 
they  afforded  the  time  to  evaluate  the  new  (a  -d  )/ 2  before  the  start  of  the  next 
step. 


HI.  SINGULAR  VALUE  DECOMPOSITION 


The  array  architecture  is  a  square  of 


n 


rt 


vertical  connections,  which  carry  to  the  off-diagonal  processors  the  angle  encod¬ 
ings  computed  by  the  diagonal  processors,  and  with  diagonal  interconnections 
between  neighboring  processors,  which  convey  the  entries  of  the  2X2  blocks 
interchanged  before  each  step.  Both  diagonal  and  off-diagonal  processors  consist 
of  two  processing  units.  The  processing  units  are  all  identical,  and  can  perform 
the  same  functions  as  a  diagonal  processor  of  the  symmetric  eigenvalue  array. 


I 

« 


d 


I 


During  the  first  phase  of  a  step,  the  diagonal  processors  rotate  in  parallel  the 
vectors  (a  -  d ,  b  +  c  )  and  (a  +  d ,  b  -  c  ),  in  order  to  evaluate  recursively  the 
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encodings  of  0+  0  +  9  and  0"  $L6  -  9  as  well  as  to  update  the  diagonal  ele¬ 

ments  within  their  respective  blocks.  The  elementary  angles  for  these  rotations 
are  20, ,  identical  to  the  ones  used  by  the  diagonal  processors  in  the  eigenvalue 
array.  In  order  to  generate  the  encoding  as  soon  as  possible,  the  unsealed  part  of 
the  rotations  must  be  performed  first.  On  the  first  two  cycles  the  two  processing 
units  of  a  diagonal  processor  compute  6  +  e  and  6  -  c ;  the  components  a  -  d 
and  a  +  d  have  already  been  computed  during  the  previous  step.  At  the  end  of 
the  second  cycle,  the  two  processing  units  contain  (*,+  ,y,+  )  =  (a  -  d ,  b  +c) 
and  (*f  ,  V f  )  ”  («  +  d,  b  -  e ),  and  the  signs  7,+  =  $ign  (x,+  y,+  )  and 
7f  =  sign  (if  y f  )  are  generated.  The  operations  effected  in  parallel  in  the  two 
processing  units  on  cycle  i  +2,  1  <  *  <  p ,  are 


tit l  =  (1  -  f,2)*,+  +  7,+-2<,  y,+  ,  Viti  =  (1  -  «,2)y.+ -  7,+-2f,  i,+ 

li+l  =  «*>«  (x.  ti  y,  +i )  , 

and 

*.+i  =  (i-f, *)*,*  +  7f i  =  (!  -  ^2)y."  -  1,_  , 

V+i  =  *«Vn(x,+i  y,+i )  . 


The  pairs  of  bits  (7,+,  7f),  1  <  *  <  p ,  generated  by  a  diagonal  processor,  make 
up  the  encoding  of  (0+,  0").  They  are  sent  via  2-bit  wide  horizontal  and  vertical 
connections  to  all  the  processors  which  are  located  either  on  the  row  or  on  the 
column  of  the  diagonal  processor.  The  last  unsealed  rotations,  performed  on 
cycle  p  +2,  align  the  vectors  (a  -  d ,  b  -I-  e  )  and  (a  +  d ,  b  -  e  )  along  the  (1,  0) 
direction  to  the  best  angular  accuracy  that  a  diagonal  processor  can  achieve. 
While  the  pair  of  bits  (7p++|,  7p“+i)  generated  on  that  cycle  is  of  no  utility,  the 
first  components  of  the  rotated  vectors  are  useful  for  computing  the  updates  7 
and  1  of  the  diagonal  elements.  Indeed,  these  components  are  the  computed 
values,  before  scaling,  of 
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x*  ^  (a  -  i  )cos0*  +  (k  +  c  Jsind*  and  x~  &  (a  +  d  )cos9~  +  (b  -  e  )sin 9~  , 
and  the  updates  of  <  and  d  are  easiiy  shown  to  satisfy 

2  2 

The  scaling  part  of  the  rotations  that  yield  x+  and  x~  begins  on  cycle  p  +3  and 
employs  the  decoupled  equations 

*.♦!  “  U  +  *  +  3i  2*i  *,+  »  fit i  —  (1  +  *,'2)y,  +  +  -2«,  y,+  , 

*,+i  *(1  +  *i2)Zi-  +  6t-  2*i*i~  ,  fi+i  *  (1  +  *,•*)»,' +  ^  *2s,  y,'  . 

Since  the  scaling  requires  9  steps  its  execution  will  spill  over  onto  the  first  two 
cycles  of  the  second  phase. 

During  the  first  phase  the  off-diagonal  processors  apply  rotations  by  9  to  the 
columns  of  their  respective  block.  During  the  first  q  cycles  of  the  phase  each 
off-diagonal  processor  normalizes  the  lengths  of  the  vectors  it  operates  upon,  thus 
giving  the  pairs  (7,+,  7,"),  generated  by  the  diagonal  processor  belonging  to  the 
same  row,  q  -2  cycles  to  reach  an  off-diagonal  processor.  The  significant,  uns¬ 
ealed,  part  of  the  rotations  by  9  is  applied  on  the  next  p  cycles.  The  angle  of  the 
»  th  rotation  increment  is  (7,-+  +  7 f~)0,- ,  hence  the  total  amount  of  rotation  after 
p  iterations  is,  as  it  should  be,  the  half-sum  of  the  decompositions  of  9+  and  9~ 

as  £  7,+,20;  and  £  7f  2*,  • 

•  -1  1 -1 

In  order  to  be  able  to  perform  the  scaling  in  q  iterations,  some  care  is 
needed  when  implementing  the  unsealed  rotations.  Observe  that  the  amount  of 
the  t  th  unsealed  rotation  can  be  +20, ,  -20,  or  simply  0.  Since  in  the  first  two 
cases  the  rotation  increases  the  length  of  the  vector  to  which  it  is  applied  by 
1  +  /,2,  the  nil  rotation  should  increase  the  length  by  the  same  amount.  A  pro¬ 
cessing  unit  can  effect  this  stretching  in  one  cycle  by  multiplying  both  vector 
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components  by  1  +  t{2.  It  is  only  because  of  this  operation  that  the  elementary 
angles  for  the  rotations  have  been  selected  equal  to  20,  instead  of  0, ,  where 
tanff,-  =  ti  is  an  integer  power  of  two.  Indeed  if  the  angles  0,  were  selected  as 
rotation  increments,  the  vector  components  would  have  to  be  multiplied  by 
y/TTp,  which  the  CORDIC  hardware  cannot  do  in  one  cycle.  (A  related  pro¬ 
perty  is  that  2 0p  is  the  smallest  rotation  increment  that  may  be  applied  in  the 
proposed  implementation,  corresponding  to  an  accuracy  of  2c.  Thus,  assuming 
an  identical  number  of  cycles  per  step,  the  accuracy  of  the  Jacobi  rotations  is  one 
bit  less  for  the  SVD  array  than  for  the  eigenvalue  array.) 

We  may  now  recapitulate  more  formally  the  operations  performed  by  the 
off-diagonal  processors  during  the  first  phase.  The  processors  apply  rotations  by  0 
to  the  columns  of  their  respective  block.  Each  processing  unit  within  a  processor 
deals  independently  with  one  of  the  two  columns.  The  q  scaling  iterations  per¬ 
formed  on  a  column  are  of  the  form 

+i  *■  U  +  2*.  *,  ,  y,  +i  ==  (1  +  »i2)Vi  +  Si  2 y,  . 

The  p  unsealed  rotation  iterations  that  follow  satisfy 

*.+ 1  *  (i  -  °i  t,2)*.  +  %+  (i  +  <*i  )U  y,  ,  y,+i  —  (i  - *i *,-2)y,-  -  7;+-(i  +  )<,  *,  , 

where  <r,  =.  sign  (<7,+-'Y,r)  with  7,+  and  7,r  received  from  the  diagonal  processor  in 
the  same  row.  Thus  either  a  nil  rotation  is  performed,  if  7,+  and  7,r  differ,  or  else 
a  rotation  by  7,+-20,  is  applied. 

At  the  beginning  of  the  second  phase  of  the  step,  the  diagonal  processors  ter¬ 
minate  the  scaling  of  the  components  z+  and  x~  and  compute  a  and  d  from 
these  components.  This  requires  a  total  of  only  four  cycles.  Then  the  diagonal 
processors  exchange  their  diagonal  elements  between  neighbors  and  compute  the 
new  a  -  d  and  a  +  d .  Afterwards,  they  merely  wait  till  the  end  of  the  phase  for 
the  off-diagonal  processors  to  finish  their  rotations. 
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During  the  second  phase,  the  off-diagonal  processors  apply  rotations  by  9 
to  the  rows  of  their  respective  blocks.  Each  processing  unit  within  a  processor 
deals  independently  with  one  of  the  two  rows.  The  q  scaling  iterations  per¬ 
formed  on  a  row  are  of  the  form 

*,  +1  =  U  +  *i2)xi  +  *i  ‘2*i  *,  ,  >,  +1  —  (1  +  «,2)y,  +  -2 a,  y,  . 

The  p  unsealed  rotation  iterations  that  follow  satisfy 

*,  + 1  =  (1  +  <*i  +  rii+i  1  -  Oi )/,  y,  ,  y,  +i  =  (1  +  <r,  f,2)y,  -  %+-(l  -  ff,  )f,-  x,  , 

where  <r,-  =  aiyn  (v  *7f)  with  <7,+  and  Tf,"  received  from  the  diagonal  processor  in 
the  same  column.  Thus  either  a  nil  rotation  is  performed,  if  7,+  and  7,'  are 
equal,  or  else  a  rotation  by  /y,+-201  is  applied.  The  second  phase  lasts  p  +q 
cycles,  like  the  first  one. 

The  next  few  cycles  are  used  to  exchange  entries  between  neighboring  pro¬ 
cessors  in  exactly  the  same  way  as  for  the  symmetric  eigenvalue  array.  Note  that 
the  updated  off-diagonal  elements  for  the  diagonal  processors  will  be  available 
only  at  the  end  of  that  exchange;  they  cannot  be  exchanged  in  advance  like  the 
diagonal  elements. 


IV.  DISCUSSION 

A  key  idea  in  our  approach  is  to  simplify  the  diagonal  processors  by  having 
them  generate  simple  rotation  encodings.  The  diagonal  processors  of  the  sym¬ 
metric  eigenvalue  array  consist  of  one  new  processing  element  and  those  of  the 
SVD  array  of  two  such  processing  elements.  This  new  processing  element  is  a 
fairly  simple  machine,  only  slightly  more  complex  than  a  CORDIC  rotor  and 
requiring  less  than  twice  the  CORDIC  rotor  area  for  the  same  cycle  time.  The 
simplification  of  the  diagonal  processors  entails  some  extra  cost  for  the  off- 
diagonal  processors,  which  must  be  able  to  construct  rotations  from  the 
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encodings.  If  that  extra  cost  were  an  increased  processor  complexity,  the 
approach  would  be  unpractical;  there  are  many  more  off-diagonal  processors  than 
diagonal  ones.  Fortunately  the  CORDIC  approach  lets  the  tradeoff  take  place  at 
another  level:  processor  speed. 

For  32-bit  precision  the  diagonal  processors  proposed  can  be  as  fast  if  not 
faster  (and  certainly  much  more  compact)  than  processors  based  on  conventional 
approaches  of  evaluating  the  entries  of  the  rotation  matrix:  using  fast  multiplica¬ 
tion,  division  and  square-root  algorithms,  or  table  look-ups  and  linear  interpola¬ 
tions.  The  off-diagonal  processors,  however,  are  significantly  slower  than  a  fast 
multiplier  based  implementation  of  the  rotation  for  about  the  same  area.  This 
relatively  poor  performance  would  doom  the  whole  approach  if  it  did  not  possess 
a  key  redeeming  feature:  the  rotations  may  be  applied  simultaneously  with 
respect  to  their  evaluation.  Indeed,  while  with  a  traditional  processor  architec¬ 
ture  the  application  of  the  rotations  cannot  start  before  the  complete  evaluation 
of  the  entries  of  the  rotation  matrices,  with  the  proposed  architecture  the  appli¬ 
cation  of  the  rotations  by  the  off-diagonal  processors  starts  at  the  same  instant  as 
their  evaluation.  The  relative  slowness  of  the  off-diagonal  processors  contributes 
little  to  the  total  computation  time:  at  the  end  of  a  step,  the  off-diagonal  proces¬ 
sors  need  about  p  cycles  to  terminate  their  computations  after  the  diagonal  pro¬ 
cessors  have  finished  theirs.  This  time  is  to  be  compared  to  the  time  required, 
with  a  traditional  processor  architecture,  for  propagating  the  cosines  and  sines 
defining  the  rotations  to  the  off-diagonal  processors  and  then  performing  the  rota¬ 
tions.  These  times  are  similar  for  practical  pin  counts,  which  impose  a  fair 
amount  of  sequentiality  in  propagating  the  sines  and  cosines.  Thus  we  claim  that 
our  processor  architecture  leads  to  a  performance  similar  to  what  could  be 
obtained  using  a  more  conventional  architecture  based  on  fast  arithmetic,  while 
offering  the  advantages  of  simpler  control  and  much  simpler  diagonal  processors. 


Two  important  architectural  variants  that  use  the  CORDIC  approach  should 
be  mentioned.  They  both  apply  to  the  SVD  array.  In  the  first  variant  the  off- 
diagonal  processors  are  replaced  by  simpler  pairs  of  CORDIC  rotors,  as  in  the 
symmetric  eigenvalue  array.  The  diagonal  processors,  which  must  then  generate 
the  encodings  for  8  and  9  directly,  employ  the  relations 


tan20  =  2- 


ab  +  ed 

t2+e2-b2_d2 


,  tan20r  =  2- 


ae  +  kd 
+  i2-c2-d2 


Thus,  in  addition  to  the  two  processing  elements  that  generate  the  encodings  for 
0  and  9  given  the  vectors  (a2+e2-62~d2, 2[ab  +cd ))  and 
(a2+62-e2-d2,  2(ac  +bd)),  the  diagonal  processors  possess  fast  multipliers  for 
the  speedy  computation  of  the  components  of  the  vectors.  The  updates  of  the 
diagonal  blocks  are  performed  by  pairs  of  CORDIC  rotors,  as  for  the  off-diagonal 
blocks.  The  array  is  therefore  a  square  of  identical  processors  -  pairs  of  CORDIC 
rotors  •  plus  a  diagonal  of  rather  complex  processors;  it  can  be  as  fast  as  the  ori¬ 
ginal  one  if  the  time  for  the  fast  multiplications  and  the  propagation  of  the  first 
bit  of  the  encoding  of  0  does  not  exceed  q  cycles.  A  second  variant  aims  at 
removing  the  idle  time  that  the  diagonal  processors  experience  during  the  second 
phase  of  each  step.  The  diagonal  processors  are  the  same  as  in  the  first  variant 
and  generate  the  encodings  for  9  and  9  directly.  A  square  array  of  identical 
processors  applies  the  rotations  to  the  diagonal  and  off-diagonal  blocks  simultane¬ 
ously  from  the  left  and  from  the  right.  More  precisely  these  processors  apply  the 
i  th  iteration  of  the  rotation  by  9  and  the  rotation  by  9  simultaneously,  in  effect 
removing  the  second  phase  of  each  step.  The  processors  contain  two  processing 
units  which,  although  they  have  more  complex  adders,  are  still  very  similar  to  the 
ones  in  the  original  SVD  array.  One  processing  unit  operates  on  the  diagonal 
entries  of  the  current  block  and  the  other  one  on  its  off-diagonal  entries.  The 
speed  up  with  respect  to  the  original  array  can  reach  two,  if  the  time  spent  for 
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the  fast  multiplications  and  the  propagation  of  the  first  bit  of  the  encodings  of  9 
and  9  does  not  exceed  q  cycles. 

One  should  be  able  to  use  the  arrays,  which  are  adapted  to  a  specific  prob¬ 
lem  size  n ,  for  the  solution  of  oversized  problems  with  only  a  marginal  loss  in 
efficiency.  Instead  of  a  block  Jacobi  scheme  [2],  a  more  appropriate  procedure  is 
to  map  the  scalar  Jacobi  scheme  on  a  virtual  array  whose  dimensions  match  the 
problem  size  (more  precisely  match  the  first  multiple  of  n  greater  or  equal  to  the 
problem  size)  and  then  fold  that  array  onto  the  real  array.  The  folding  is  just 
like  a  paper  folding  into  squares  of  the  same  size  as  the  real  array  and  defines  the 
allocation  of  the  2X2  blocks  of  the  matrix  onto  the  real  array.  (Each  processor 
could  communicate  with  its  own  memory  chip  holding  the  blocks  allocated  to  it.) 
The  sequencing  in  time  is  such  that,  if  we  unfold  our  folded  paper,  the  squares 
along  the  diagonal  are  processed  first  in  order  to  determine  the  rotation  angles. 
The  exchanges  between  steps  still  involve  only  neighboring  processors.  The  con¬ 
trol  remains  simple  and  the  efficiency  very  high. 
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